Hello Jolene!
Alexey Kargin is here! Nice to meet you again! Unfortunately, Olga cannot review your project now, so that I will do that. I've added all my comments on your project to cells with different coloring. Please, do not move or delete them - it helps us move faster:
To make a dialog, let’s use this style for your remarks:
Now, let’s go to data analysis!
Hello Jolene!
My name is Olga. I'm happy to review your project today.
When I will see mistake at the first time, I will just point it out. I let you find it and fix it by yourself. I'm trying to prepare you to work as an Data Analyst. SO, at a real job, your team lead will do the same. But if you can't handle this task yet, I will give you a more accurate hint at the next iteration.
Below you will find my comments - please do not move, modify or delete them.
You can find my comments in green, yellow or red boxes like this:
You can answer me by using this:
Project Description
Technical description
recommender_system_testproduct_page event), product card views (product_card) and purchases (purchase). At each of the stage of the funnel product_page → product_card → purchase, there will be at least a 10% increase.Project Goal
product_page → product_card → purchaseHypotheses
Tasks
Important things to consider
Despcription of the data
ab_project_marketing_events_us.csv — the calendar of marketing events for 2020name — the name of the marketing eventregions — regions where the ad campaign will be heldstart_dt — campaign start datefinish_dt — campaign end datefinal_ab_new_users_upd_us.csv — all users who signed up in the online store from December 7 to 21, 2020user_idfirst_date — sign-up dateregiondevice — device used to sign upfinal_ab_events_upd_us.csv — all events of the new users within the period from December 7, 2020 to January 1, 2021user_idevent_dt — event date and timeevent_name — event type namedetails — additional data on the event (for instance, the order total in USD for purchase events)final_ab_participants_upd_us.csv — table containing test participantsuser_idab_test — test namegroup — the test group the user belonged toInstructions for completing the project
Step 1. Open the data file(s) and examine the general information
Step 3. Carry out exploratory data analysis (EDA)
Step 4. Evaluate the A/B test results
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from plotly import graph_objects as go
import scipy.stats as st
import math as mth
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
sns.set_palette('bright')
pd.set_option('max_colwidth', 450)
Plan of Action
Open and save original data as a dataframe
# # read and open original data as a dataframe
# marketing_events_orig = pd.read_csv('/datasets/ab_project_marketing_events_us.csv')
# users_orig = pd.read_csv('/datasets/final_ab_new_users_upd_us.csv')
# user_events_orig = pd.read_csv('/datasets/final_ab_events_upd_us.csv')
# participants_orig = pd.read_csv('/datasets/final_ab_participants_upd_us.csv')
# read and open original data as a dataframe
try:
marketing_events_orig = pd.read_csv('ab_project_marketing_events_us.csv')
users_orig = pd.read_csv('final_ab_new_users_upd_us.csv')
user_events_orig = pd.read_csv('final_ab_events_upd_us.csv')
participants_orig = pd.read_csv('final_ab_participants_upd_us.csv')
except:
marketing_events_orig = pd.read_csv('datasets/ab_project_marketing_events_us.csv')
users_orig = pd.read_csv('datasets/final_ab_new_users_upd_us.csv')
user_events_orig = pd.read_csv('datasets/final_ab_events_upd_us.csv')
participants_orig = pd.read_csv('datasets/final_ab_participants_upd_us.csv')
Examine general info & Take a closer look at specific columns
Marketing Events
print('marketing events info:')
display(marketing_events_orig.info())
marketing events info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 14 entries, 0 to 13 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 14 non-null object 1 regions 14 non-null object 2 start_dt 14 non-null object 3 finish_dt 14 non-null object dtypes: object(4) memory usage: 576.0+ bytes
None
print('marketing events data:')
marketing_events_orig
marketing events data:
| name | regions | start_dt | finish_dt | |
|---|---|---|---|---|
| 0 | Christmas&New Year Promo | EU, N.America | 2020-12-25 | 2021-01-03 |
| 1 | St. Valentine's Day Giveaway | EU, CIS, APAC, N.America | 2020-02-14 | 2020-02-16 |
| 2 | St. Patric's Day Promo | EU, N.America | 2020-03-17 | 2020-03-19 |
| 3 | Easter Promo | EU, CIS, APAC, N.America | 2020-04-12 | 2020-04-19 |
| 4 | 4th of July Promo | N.America | 2020-07-04 | 2020-07-11 |
| 5 | Black Friday Ads Campaign | EU, CIS, APAC, N.America | 2020-11-26 | 2020-12-01 |
| 6 | Chinese New Year Promo | APAC | 2020-01-25 | 2020-02-07 |
| 7 | Labor day (May 1st) Ads Campaign | EU, CIS, APAC | 2020-05-01 | 2020-05-03 |
| 8 | International Women's Day Promo | EU, CIS, APAC | 2020-03-08 | 2020-03-10 |
| 9 | Victory Day CIS (May 9th) Event | CIS | 2020-05-09 | 2020-05-11 |
| 10 | CIS New Year Gift Lottery | CIS | 2020-12-30 | 2021-01-07 |
| 11 | Dragon Boat Festival Giveaway | APAC | 2020-06-25 | 2020-07-01 |
| 12 | Single's Day Gift Promo | APAC | 2020-11-11 | 2020-11-12 |
| 13 | Chinese Moon Festival | APAC | 2020-10-01 | 2020-10-07 |
print('Marketing event information:')
print('number of unique events:', marketing_events_orig['name'].nunique())
print('number of regions:', marketing_events_orig['regions'].nunique())
print('number of start dates:', marketing_events_orig['start_dt'].nunique())
print('number of end dates:', marketing_events_orig['finish_dt'].nunique())
Marketing event information: number of unique events: 14 number of regions: 6 number of start dates: 14 number of end dates: 14
Marketing events conclusion
Christmas&New Year Promo on 2020-12-25 and CIS New Year Gift Lottery on 2020-12-30Christmas&New Year Promo campain was held in the EU regionUsers data
print('users info:')
display(users_orig.info())
users info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 58703 entries, 0 to 58702 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 58703 non-null object 1 first_date 58703 non-null object 2 region 58703 non-null object 3 device 58703 non-null object dtypes: object(4) memory usage: 1.8+ MB
None
print('users data:')
users_orig.head()
users data:
| user_id | first_date | region | device | |
|---|---|---|---|---|
| 0 | D72A72121175D8BE | 2020-12-07 | EU | PC |
| 1 | F1C668619DFE6E65 | 2020-12-07 | N.America | Android |
| 2 | 2E1BF1D4C37EA01F | 2020-12-07 | EU | PC |
| 3 | 50734A22C0C63768 | 2020-12-07 | EU | iPhone |
| 4 | E1BDDCE0DAFA2679 | 2020-12-07 | N.America | iPhone |
print('number of unique users:', users_orig['user_id'].nunique())
print('number of unique dates:', users_orig['first_date'].nunique())
print('number of duplicate rows:', users_orig.duplicated().sum())
number of unique users: 58703 number of unique dates: 17 number of duplicate rows: 0
print('number of entries per date:')
users_orig['first_date'].value_counts(dropna = False)
number of entries per date:
2020-12-21 6077 2020-12-14 5448 2020-12-07 5291 2020-12-13 4181 2020-12-20 4140 2020-12-12 3591 2020-12-19 3480 2020-12-18 3238 2020-12-22 3083 2020-12-08 3017 2020-12-17 2940 2020-12-15 2924 2020-12-10 2784 2020-12-11 2226 2020-12-23 2180 2020-12-16 2093 2020-12-09 2010 Name: first_date, dtype: int64
print('number of entries per region:')
users_orig['region'].value_counts(dropna = False)
number of entries per region:
EU 43396 N.America 9051 CIS 3130 APAC 3126 Name: region, dtype: int64
print('number of entries per device:')
users_orig['device'].value_counts(dropna = False)
number of entries per device:
Android 26159 PC 14845 iPhone 11902 Mac 5797 Name: device, dtype: int64
Users data conclusion
User Events
print('new users events info:')
display(user_events_orig.info())
new users events info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 423761 entries, 0 to 423760 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 423761 non-null object 1 event_dt 423761 non-null object 2 event_name 423761 non-null object 3 details 60314 non-null float64 dtypes: float64(1), object(3) memory usage: 12.9+ MB
None
print('user events:')
user_events_orig.head()
user events:
| user_id | event_dt | event_name | details | |
|---|---|---|---|---|
| 0 | E1BDDCE0DAFA2679 | 2020-12-07 20:22:03 | purchase | 99.99 |
| 1 | 7B6452F081F49504 | 2020-12-07 09:22:53 | purchase | 9.99 |
| 2 | 9CD9F34546DF254C | 2020-12-07 12:59:29 | purchase | 4.99 |
| 3 | 96F27A054B191457 | 2020-12-07 04:02:40 | purchase | 4.99 |
| 4 | 1FD7660FDF94CA1F | 2020-12-07 10:15:09 | purchase | 4.99 |
print('number of unique users:', user_events_orig['user_id'].nunique())
print('number of duplicate rows:', user_events_orig.duplicated().sum())
number of unique users: 58703 number of duplicate rows: 0
print('number of entries per event type:')
user_events_orig['event_name'].value_counts(dropna = False)
number of entries per event type:
login 182465 product_page 120862 purchase 60314 product_cart 60120 Name: event_name, dtype: int64
print('number of entries per detail:')
user_events_orig['details'].value_counts(dropna = False)
number of entries per detail:
NaN 363447 4.99 44579 9.99 9158 99.99 5405 499.99 1172 Name: details, dtype: int64
perc_details_missing = user_events_orig['details'].isnull().sum() / len(user_events_orig)
print('{:.2%} of details are missing data'.format(perc_details_missing))
85.77% of details are missing data
User Event Conclusion
details column is missing dataGroup Participants
print('group participants info:')
display(participants_orig.info())
group participants info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 14525 entries, 0 to 14524 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 14525 non-null object 1 group 14525 non-null object 2 ab_test 14525 non-null object dtypes: object(3) memory usage: 340.6+ KB
None
print('group participants:')
participants_orig.head()
group participants:
| user_id | group | ab_test | |
|---|---|---|---|
| 0 | D1ABA3E2887B6A73 | A | recommender_system_test |
| 1 | A7A3664BD6242119 | A | recommender_system_test |
| 2 | DABC14FDDFADD29E | A | recommender_system_test |
| 3 | 04988C5DF189632E | A | recommender_system_test |
| 4 | 4FF2998A348C484F | A | recommender_system_test |
print('number of unique users in test groups:', participants_orig['user_id'].nunique())
print('{} duplicate rows'.format(participants_orig.duplicated().sum()))
in_both_groups = len(participants_orig) - participants_orig['user_id'].nunique()
print('{} of entries may be in both groups'.format(in_both_groups))
number of unique users in test groups: 13638 0 duplicate rows 887 of entries may be in both groups
print('number of entries per group:')
participants_orig['group'].value_counts(dropna = False)
number of entries per group:
A 8214 B 6311 Name: group, dtype: int64
print('number of entries per test:')
participants_orig['ab_test'].value_counts(dropna = False)
number of entries per test:
interface_eu_test 10850 recommender_system_test 3675 Name: ab_test, dtype: int64
print('number of group entries per test:')
# participants_orig[['ab_test', 'group']].value_counts()
participants_orig.groupby('ab_test')['group'].value_counts()
number of group entries per test:
ab_test group
interface_eu_test A 5467
B 5383
recommender_system_test A 2747
B 928
Name: group, dtype: int64
Group Participants Conclusion
interface_eu_testrecommender_system_testMarketing events conclusion
Christmas&New Year Promo on 2020-12-25 and CIS New Year Gift Lottery on 2020-12-30Christmas&New Year Promo campain was held in the EU regionUsers data conclusion
User Event Conclusion
details column is missing dataGroup Participants Conclusion
interface_eu_testrecommender_system_testPlan of Action
read and save a working version of data
# # read and save working version of data as a dataframe
# marketing_events = pd.read_csv('/datasets/ab_project_marketing_events_us.csv')
# users = pd.read_csv('/datasets/final_ab_new_users_upd_us.csv')
# user_events = pd.read_csv('/datasets/final_ab_events_upd_us.csv')
# participants = pd.read_csv('/datasets/final_ab_participants_upd_us.csv')
# read and save working version of data as a dataframe
try:
marketing_events = pd.read_csv('ab_project_marketing_events_us.csv')
users = pd.read_csv('final_ab_new_users_upd_us.csv')
user_events = pd.read_csv('final_ab_events_upd_us.csv')
participants = pd.read_csv('final_ab_participants_upd_us.csv')
except:
marketing_events = pd.read_csv('datasets/ab_project_marketing_events_us.csv')
users = pd.read_csv('datasets/final_ab_new_users_upd_us.csv')
user_events = pd.read_csv('datasets/final_ab_events_upd_us.csv')
participants = pd.read_csv('datasets/final_ab_participants_upd_us.csv')
Change column names
# rename columns to be more readable
marketing_events = marketing_events.rename(columns = {'name': 'marketing_event',
'start_dt': 'marketing_start_date',
'finish_dt': 'marketing_end_date'})
users = users.rename(columns = {'first_date': 'sign_up_date'})
user_events = user_events.rename(columns = {'event_dt': 'event_datetime',
'details': 'purchase_usd'})
# confirm rnamed columns
print('marketing events columns:\n', marketing_events.columns, '\n')
print('users columns:\n', users.columns, '\n')
print('user events columns:\n', user_events.columns)
marketing events columns:
Index(['marketing_event', 'regions', 'marketing_start_date',
'marketing_end_date'],
dtype='object')
users columns:
Index(['user_id', 'sign_up_date', 'region', 'device'], dtype='object')
user events columns:
Index(['user_id', 'event_datetime', 'event_name', 'purchase_usd'], dtype='object')
Change datatypes
# change dates to datetime types
marketing_events['marketing_start_date'] = pd.to_datetime(marketing_events['marketing_start_date'])
marketing_events['marketing_end_date'] = pd.to_datetime(marketing_events['marketing_end_date'])
users['sign_up_date'] = pd.to_datetime(users['sign_up_date'])
user_events['event_datetime'] = pd.to_datetime(user_events['event_datetime'])
# review and confirm updated dataframes info
print('marketing event info:')
display(marketing_events.info())
print('user info:')
display(users.info())
print('user events info:')
display(user_events.info())
marketing event info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 14 entries, 0 to 13 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 marketing_event 14 non-null object 1 regions 14 non-null object 2 marketing_start_date 14 non-null datetime64[ns] 3 marketing_end_date 14 non-null datetime64[ns] dtypes: datetime64[ns](2), object(2) memory usage: 576.0+ bytes
None
user info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 58703 entries, 0 to 58702 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 58703 non-null object 1 sign_up_date 58703 non-null datetime64[ns] 2 region 58703 non-null object 3 device 58703 non-null object dtypes: datetime64[ns](1), object(3) memory usage: 1.8+ MB
None
user events info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 423761 entries, 0 to 423760 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 423761 non-null object 1 event_datetime 423761 non-null datetime64[ns] 2 event_name 423761 non-null object 3 purchase_usd 60314 non-null float64 dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 12.9+ MB
None
Process missing data
user_events table has missing information in the purchase_usd column - 85% of entries are missing this information. Removing this much data is inappropriate. NaN may cause analysis issues downt the line.# change missing values, `NaN`s, to 0
user_events['purchase_usd'] = user_events['purchase_usd'].fillna(0.0)
print('number of missing values after processing missing values:', user_events['purchase_usd'].isnull().sum(), '\n')
print('number of entries per purchase_usd:')
display(user_events['purchase_usd'].value_counts())
number of missing values after processing missing values: 0 number of entries per purchase_usd:
0.00 363447 4.99 44579 9.99 9158 99.99 5405 499.99 1172 Name: purchase_usd, dtype: int64
Actions Performed
purchase_usd column in the user_events table has missing informationNaNs, with 0 to prevent any problems with data processing down the lineExamine duplicate users in the test groups
# review unique and duplicate users in the tests
print('number of unique users in test groups:', participants['user_id'].nunique())
in_both_groups = len(participants) - participants['user_id'].nunique()
print('{} of entries may be in both groups'.format(in_both_groups))
number of unique users in test groups: 13638 887 of entries may be in both groups
participants.head(2)
| user_id | group | ab_test | |
|---|---|---|---|
| 0 | D1ABA3E2887B6A73 | A | recommender_system_test |
| 1 | A7A3664BD6242119 | A | recommender_system_test |
participants['ab_test'].value_counts()
interface_eu_test 10850 recommender_system_test 3675 Name: ab_test, dtype: int64
print('number of duplicates user in both A/B tests `interface_eu_test` and `recommender_system_test`:', participants['user_id'].duplicated().sum())
number of duplicates user in both A/B tests `interface_eu_test` and `recommender_system_test`: 887
test_duplicates_are_in = participants[participants['user_id'].duplicated()]
print('number of duplicate `user_id`s in data with both tests present, brokend down by group A or B:')
print(test_duplicates_are_in.groupby('ab_test')['group'].value_counts())
# display(test_duplicates_are_in.head())
number of duplicate `user_id`s in data with both tests present, brokend down by group A or B:
ab_test group
interface_eu_test A 456
B 431
Name: group, dtype: int64
# create queries for each ab_test name
rec_test_query = participants.query('ab_test == "recommender_system_test"')
eu_test_query = participants.query('ab_test == "interface_eu_test"')
# check for duplicates with in each ab_test name
print('number of duplicate users in `recommender_system_test`:', rec_test_query['user_id'].duplicated().sum())
print('number of duplicate users in `interface_eu_test`:', eu_test_query['user_id'].duplicated().sum())
number of duplicate users in `recommender_system_test`: 0 number of duplicate users in `interface_eu_test`: 0
# the reviewer's cell: let's check the users who took two tests at the same time
participants.groupby('user_id').agg({'ab_test' : 'nunique'}).query('ab_test > 1')
| ab_test | |
|---|---|
| user_id | |
| 001064FEAAB631A1 | 2 |
| 00341D8401F0F665 | 2 |
| 0082295A41A867B5 | 2 |
| 00E68F103C66C1F7 | 2 |
| 00EFA157F7B6E1C4 | 2 |
| ... | ... |
| FEA0C585A53E7027 | 2 |
| FEC0BCA6C323872F | 2 |
| FF2174A1AA0EAD20 | 2 |
| FF44696E39039D29 | 2 |
| FFF28D02B1EACBE1 | 2 |
887 rows × 1 columns
# the reviewer's cell: let's check the users who took two tests at the same time
tmp = participants.groupby('user_id').agg({'ab_test' : 'nunique'}).query('ab_test > 1').reset_index()
participants.query('ab_test == "recommender_system_test" and user_id in @tmp.user_id').groupby('group').agg(user_from_sec_test = ('user_id', 'nunique')).reset_index().merge(
participants.query('ab_test == "recommender_system_test"').groupby('group').agg(total_user = ('user_id', 'nunique')).reset_index(), on = 'group')
| group | user_from_sec_test | total_user | |
|---|---|---|---|
| 0 | A | 665 | 2747 |
| 1 | B | 222 | 928 |
# student break down of the reviewer's cell: let's check the users who took two tests at the same time
# find the number of users that are in 2 ab tests per group
users_from_rec_test_in_2_tests = participants.query('ab_test == "recommender_system_test" and user_id in @tmp.user_id').groupby('group').agg(user_from_rec_test = ('user_id', 'nunique')).reset_index()
users_from_eu_test_in_2_tests = participants.query('ab_test == "interface_eu_test" and user_id in @tmp.user_id').groupby('group').agg(user_from_eu_test = ('user_id', 'nunique')).reset_index()
num_uni_users_in_rec_test = participants.query('ab_test == "recommender_system_test"').groupby('group').agg(total_user = ('user_id', 'nunique')).reset_index()
# .merge(
# , on = 'group') interface_eu_test
display(users_from_rec_test_in_2_tests)
display(users_from_eu_test_in_2_tests)
display(num_uni_users_in_rec_test)
| group | user_from_rec_test | |
|---|---|---|
| 0 | A | 665 |
| 1 | B | 222 |
| group | user_from_eu_test | |
|---|---|---|
| 0 | A | 456 |
| 1 | B | 431 |
| group | total_user | |
|---|---|---|
| 0 | A | 2747 |
| 1 | B | 928 |
# participants
# student break down of the reviewer's cell:
tmp = participants.groupby('user_id').agg({'ab_test' : 'nunique'}).query('ab_test > 1').reset_index()
# find users that in the eu test, part of 2 ab tests, and in group B only
tmp_B = participants.query('ab_test == "interface_eu_test" and user_id in @tmp.user_id and group == "B"')
tmp_A = participants.query('ab_test == "interface_eu_test" and user_id in @tmp.user_id and group == "A"')
# find users in the rec test that are from group B of eu test that are in 2 ab tests
group_b_overlap = participants.query('ab_test == "recommender_system_test" and user_id in @tmp_B.user_id').groupby('group').agg(user_from_sec_test_B_group = ('user_id', 'nunique')).reset_index()
display(group_b_overlap)
group_a_overlap = participants.query('ab_test == "recommender_system_test" and user_id in @tmp_A.user_id').groupby('group').agg(user_from_sec_test_A_group = ('user_id', 'nunique')).reset_index()
display(group_a_overlap)
# .merge(
# participants.query('ab_test == "recommender_system_test"').groupby('group').agg(total_user = ('user_id', 'nunique')).reset_index(), on = 'group').assign(
# proprtion = lambda x: x['user_from_sec_test_B_group'] / x['total_user'])
| group | user_from_sec_test_B_group | |
|---|---|---|
| 0 | A | 325 |
| 1 | B | 106 |
| group | user_from_sec_test_A_group | |
|---|---|---|
| 0 | A | 340 |
| 1 | B | 116 |
# the reviewer's cell:
tmp = participants.groupby('user_id').agg({'ab_test' : 'nunique'}).query('ab_test > 1').reset_index()
tmp_B = participants.query('ab_test == "interface_eu_test" and user_id in @tmp.user_id and group == "B"')
participants.query('ab_test == "recommender_system_test" and user_id in @tmp_B.user_id').groupby('group').agg(user_from_sec_test_B_group = ('user_id', 'nunique')).reset_index().merge(participants.query('ab_test == "recommender_system_test"').groupby('group').agg(total_user = ('user_id', 'nunique')).
reset_index(), on = 'group').assign(proprtion = lambda x: x['user_from_sec_test_B_group'] / x['total_user'])
| group | user_from_sec_test_B_group | total_user | proprtion | |
|---|---|---|---|---|
| 0 | A | 325 | 2747 | 0.118311 |
| 1 | B | 106 | 928 | 0.114224 |
# ### student notes to break down the code in cell above
# # the reviewer's cell:
# ### find users that were in more than one test
# tmp = participants.groupby('user_id').agg({'ab_test' : 'nunique'}).query('ab_test > 1').reset_index()
# ### from users that were in more than one test, find those that were in group B only of the eu test
# tmp_B = participants.query('ab_test == "interface_eu_test" and user_id in @tmp.user_id and group == "B"')
# ### get the eu group B users that were in two test from the rec test
# participants.query('ab_test == "recommender_system_test" and user_id in @tmp_B.user_id').
# ### separate the users above into which test group they were in A or B, name a new column and find the unique # of users
# groupby('group').agg(user_from_sec_test_B_group = ('user_id', 'nunique')).reset_index().
# ### merge the small table with the number of total users per group in the rec test
# merge(participants.query('ab_test == "recommender_system_test"').groupby('group').agg(total_user = ('user_id', 'nunique')).
# reset_index(), on = 'group').
# ### determine the % of eu test group B users that were in group A and B of the rec test
# assign(proprtion = lambda x: x['user_from_sec_test_B_group'] / x['total_user'])
# ### results: the proportions of users that saw the eu test (group B) is the same between the two groups (A and B) in the rec test
# ### because the proportions are about equal, it can be suggested that the any results in rec test group B is not due to also being in the eu test
# examine the duplicate users
duplicate_users = participants[participants['user_id'].duplicated()]
print('duplicate users table:')
display(duplicate_users.head())
# evaluate which test the duplicates are in
duplicate_test_group_breakdown = duplicate_users.groupby(['ab_test', 'group'])['user_id'].nunique().reset_index()
duplicate_test_group_breakdown.columns = ['ab_test', 'group', 'num_users_per_group']
display(duplicate_test_group_breakdown)
duplicate users table:
| user_id | group | ab_test | |
|---|---|---|---|
| 3677 | FB6F6BC119E1DBD5 | B | interface_eu_test |
| 3688 | 055A4CD17A483B8E | A | interface_eu_test |
| 3695 | 7060BB1CBE3BAACC | B | interface_eu_test |
| 3711 | 8A26EBC1BA92524B | B | interface_eu_test |
| 3730 | A791C2D62B213C96 | B | interface_eu_test |
| ab_test | group | num_users_per_group | |
|---|---|---|---|
| 0 | interface_eu_test | A | 456 |
| 1 | interface_eu_test | B | 431 |
# calcluate number of duplicates in each group
num_dup_a = len(duplicate_users.query('group == "A"'))
num_dup_b = len(duplicate_users.query('group == "B"'))
# calculate the number of participants in each group in interface_eu_test
num_a = len(participants.query('group == "A" & ab_test == "interface_eu_test"'))
num_b = len(participants.query('group == "B" & ab_test == "interface_eu_test"'))
# calcultate the percentage of duplicats in each group
perc_dup_a = num_dup_a / num_a
perc_dup_b = num_dup_b / num_b
print('{:.2%} of duplicates are in group A'.format(perc_dup_a))
print('{:.2%} of duplicates are in group B'.format(perc_dup_b))
8.34% of duplicates are in group A 8.01% of duplicates are in group B
Conclusion: Examine duplicate users in the test groups
interface_eu_test and fall with in 10% of the data for each groupRemove duplicate users from the tests
# create a list of the user ids of the duplicates
duplicate_user_ids_list = list(duplicate_users['user_id'])
# create a new table without the duplicates
participants_clean = participants.query('user_id not in @duplicate_user_ids_list')
print('number of duplicate ids after removal:', participants_clean['user_id'].duplicated().sum())
number of duplicate ids after removal: 0
Does the data contain 15% of new users from the EU region?
Audience: 15% of the new users from the EU region - confirm that the users in the test data are from the EU region# merge tables on 'user_id' to find which region the test participants are from
test_participants = pd.merge(participants_clean, users, on = 'user_id')
test_participants.head()
| user_id | group | ab_test | sign_up_date | region | device | |
|---|---|---|---|---|---|---|
| 0 | D1ABA3E2887B6A73 | A | recommender_system_test | 2020-12-07 | EU | PC |
| 1 | A7A3664BD6242119 | A | recommender_system_test | 2020-12-20 | EU | iPhone |
| 2 | 4FF2998A348C484F | A | recommender_system_test | 2020-12-20 | EU | Mac |
| 3 | 7473E0943673C09E | A | recommender_system_test | 2020-12-16 | EU | iPhone |
| 4 | C46FE336D240A054 | A | recommender_system_test | 2020-12-17 | EU | iPhone |
test_participants.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 12751 entries, 0 to 12750 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 12751 non-null object 1 group 12751 non-null object 2 ab_test 12751 non-null object 3 sign_up_date 12751 non-null datetime64[ns] 4 region 12751 non-null object 5 device 12751 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 697.3+ KB
# determine which region the test participants are in
print('number of test participants per region per test:')
region_test_breakdown = test_participants.groupby(['ab_test', 'group'])['region'].value_counts()#.reset_index()
# region_test_breakdown.columns = ['ab_test', 'group', 'region', 'num_users']
region_test_breakdown
number of test participants per region per test:
ab_test group region
interface_eu_test A EU 5011
B EU 4952
recommender_system_test A EU 1939
N.America 96
APAC 28
CIS 19
B EU 655
N.America 23
APAC 17
CIS 11
Name: region, dtype: int64
num_participants_in_eu_test = len(test_participants.query('ab_test == "interface_eu_test"'))
num_new_users = len(users)
perc_eu_test_participants_of_new_users = num_participants_in_eu_test / num_new_users
print('{:.2%} of `interface_eu_test` participants are new users; 15% is expected'.format(perc_eu_test_participants_of_new_users))
16.97% of `interface_eu_test` participants are new users; 15% is expected
Conclusion: Does the data contain 15% of new users from the EU region?
Audience: 15% of the new users from the EU region - confirm that the users in the test data are from the EU regioninterface_eu_test participants are from the EU regionrecommender_system_test contains users from several regionsinterface_eu_test participants are new users; 15% is expected# calculate the number of users in the recommender_system_test in the EU region
num_eu_participants_in_rec_test = len(test_participants.query('ab_test == "recommender_system_test" & region == "EU"'))
# calculate the percentage of new uesrs in the recommender_system_test from the EU
perc_eu_participants_in_rec_test_of_new_users = num_eu_participants_in_rec_test / num_new_users
print('{:.2%} of new users in the `recommender_system_test` are from the EU; 15% is expected'.format(perc_eu_participants_in_rec_test_of_new_users))
4.42% of new users in the `recommender_system_test` are from the EU; 15% is expected
Review the first and last data entry collected in data set
# review the first and last days of data collection in the data set
print(user_events['event_datetime'].min(), ':first data entry collected in data set')
print(user_events['event_datetime'].max(), ':last data entry collected in data set')
2020-12-07 00:00:33 :first data entry collected in data set 2020-12-30 23:36:33 :last data entry collected in data set
Does the data cover the appropriate number of days?
# review number of unique sign up dates for all collection dates
print('number of unique sign up dates for all data collection dates:', test_participants['sign_up_date'].nunique())
# create a query to limit the data to test participants that have a full 14 days of collection data
test_participants_14_days_data = test_participants.query('sign_up_date <= "2020-12-16"')
print('number of unique sign up dates for participants that have a full 14 days of collection data:', test_participants_14_days_data['sign_up_date'].nunique())
num_test_participants = test_participants['user_id'].nunique()
num_test_participants_14_days_data = test_participants_14_days_data['user_id'].nunique()
perc_data_removed = 1 - num_test_participants_14_days_data / num_test_participants
print('{:.2%} of users were removed from test data because they do not have 14 full days of data collection'.format(perc_data_removed))
number of unique sign up dates for all data collection dates: 17 number of unique sign up dates for participants that have a full 14 days of collection data: 10 43.26% of users were removed from test data because they do not have 14 full days of data collection
# determine which region the test participants are in
print('number of test participants per region per test:')
region_test_breakdown_14_days_data = test_participants_14_days_data.groupby(['ab_test', 'group'])['region'].value_counts()
region_test_breakdown_14_days_data
number of test participants per region per test:
ab_test group region
interface_eu_test A EU 2922
B EU 2790
recommender_system_test A EU 969
N.America 48
CIS 13
APAC 11
B EU 450
N.America 14
APAC 11
CIS 7
Name: region, dtype: int64
Conclusion: Does the data cover the appropriate number of days?
interface_eu_testis still about evently split: 2922 and 2790 respectivelytest_participants_14_days_data.head()
| user_id | group | ab_test | sign_up_date | region | device | |
|---|---|---|---|---|---|---|
| 0 | D1ABA3E2887B6A73 | A | recommender_system_test | 2020-12-07 | EU | PC |
| 3 | 7473E0943673C09E | A | recommender_system_test | 2020-12-16 | EU | iPhone |
| 5 | 92CB588012C10D3D | A | recommender_system_test | 2020-12-09 | EU | PC |
| 6 | 66FC298441D50783 | A | recommender_system_test | 2020-12-08 | EU | iPhone |
| 15 | E0B12CEF93FA04B1 | A | recommender_system_test | 2020-12-15 | EU | Android |
test_participants_14_days_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7235 entries, 0 to 12750 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 7235 non-null object 1 group 7235 non-null object 2 ab_test 7235 non-null object 3 sign_up_date 7235 non-null datetime64[ns] 4 region 7235 non-null object 5 device 7235 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 395.7+ KB
Study conversion at different funnel stages
filter events for users that have 14 days worth of data
# create a list of the users with 14 days worth of data collection
test_participants_14_days_data_user_list = test_participants_14_days_data['user_id'].unique()
print('number of users with 14 days of data: {}; expect: {}'.format(len(test_participants_14_days_data_user_list),
len(test_participants_14_days_data)))
number of users with 14 days of data: 7235; expect: 7235
# review number of entries in user_events table
num_user_event_entries = len(user_events)
print('number of entries in `user_events` table for all data collection:', num_user_event_entries)
# review number of users in user_event table
num_unique_users_in_events_table = user_events['user_id'].nunique()
print('number of unique users in `user_events` table:', num_unique_users_in_events_table)
# use list of users with 14 days of data to create a table with their event information
user_events_14_days_data = user_events.query('user_id in @test_participants_14_days_data_user_list')
# calculate number of entries in new table
num_user_event_entries_14_days_data = len(user_events_14_days_data)
print('number of user event entries for users that have 14 days of data collection:', num_user_event_entries_14_days_data)
# calculate the number of users left
num_unique_users_left_from_events = user_events_14_days_data['user_id'].nunique()
# calculate percentage of entries removed
perc_event_entries_removed = 1 - num_user_event_entries_14_days_data / num_user_event_entries
print('{:.2%} of user event entries were removed because the users do not have 14 days worth of data collected'.format(perc_event_entries_removed))
#calculate percentage of users removed
perc_users_removed_from_events = num_unique_users_left_from_events / num_unique_users_in_events_table
print('{:.2%} of users removed from events table because the users do not have 14 days worth of data collected'.format(perc_users_removed_from_events))
number of entries in `user_events` table for all data collection: 423761 number of unique users in `user_events` table: 58703 number of user event entries for users that have 14 days of data collection: 56428 86.68% of user event entries were removed because the users do not have 14 days worth of data collected 12.32% of users removed from events table because the users do not have 14 days worth of data collected
Merge the events and test participant tables
user_events_14_days_data.head(2)
| user_id | event_datetime | event_name | purchase_usd | |
|---|---|---|---|---|
| 3 | 96F27A054B191457 | 2020-12-07 04:02:40 | purchase | 4.99 |
| 5 | 831887FE7F2D6CBA | 2020-12-07 06:50:29 | purchase | 4.99 |
user_events_14_days_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 56428 entries, 3 to 423759 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 56428 non-null object 1 event_datetime 56428 non-null datetime64[ns] 2 event_name 56428 non-null object 3 purchase_usd 56428 non-null float64 dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 2.2+ MB
test_participants_14_days_data.head(2)
| user_id | group | ab_test | sign_up_date | region | device | |
|---|---|---|---|---|---|---|
| 0 | D1ABA3E2887B6A73 | A | recommender_system_test | 2020-12-07 | EU | PC |
| 3 | 7473E0943673C09E | A | recommender_system_test | 2020-12-16 | EU | iPhone |
test_participants_14_days_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7235 entries, 0 to 12750 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 7235 non-null object 1 group 7235 non-null object 2 ab_test 7235 non-null object 3 sign_up_date 7235 non-null datetime64[ns] 4 region 7235 non-null object 5 device 7235 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 395.7+ KB
# merge the user events and test participant data
test_user_events_14_days = pd.merge(user_events_14_days_data,
test_participants_14_days_data,
on = 'user_id')
# review the data
test_user_events_14_days.head()
| user_id | event_datetime | event_name | purchase_usd | group | ab_test | sign_up_date | region | device | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 96F27A054B191457 | 2020-12-07 04:02:40 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 1 | 96F27A054B191457 | 2020-12-08 09:43:14 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 2 | 96F27A054B191457 | 2020-12-09 00:44:10 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 3 | 96F27A054B191457 | 2020-12-26 00:33:57 | purchase | 9.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 4 | 96F27A054B191457 | 2020-12-07 04:02:41 | product_page | 0.00 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
# examine the genearl info
test_user_events_14_days.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 56428 entries, 0 to 56427 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 56428 non-null object 1 event_datetime 56428 non-null datetime64[ns] 2 event_name 56428 non-null object 3 purchase_usd 56428 non-null float64 4 group 56428 non-null object 5 ab_test 56428 non-null object 6 sign_up_date 56428 non-null datetime64[ns] 7 region 56428 non-null object 8 device 56428 non-null object dtypes: datetime64[ns](2), float64(1), object(6) memory usage: 4.3+ MB
Examine the number of events
# create a table to examine the user funnel for all of the data
all_occurance_funnel = test_user_events_14_days.groupby('event_name')['user_id'].count().reset_index()
# rename the columns
all_occurance_funnel.columns = ['event_name', 'event_occurance']
all_occurance_funnel.sort_values(by = 'event_occurance', ascending = False)
| event_name | event_occurance | |
|---|---|---|
| 0 | login | 24431 |
| 2 | product_page | 15801 |
| 3 | purchase | 8153 |
| 1 | product_cart | 8043 |
# plot a graph of the number of users for each event type
all_occurance_funnel_graph = all_occurance_funnel.plot(kind = 'bar')
plt.title('Number of times an event occurs')
all_occurance_funnel_graph.set_xticklabels(all_occurance_funnel['event_name'], rotation = 45, ha = 'right')
plt.xlabel('event type')
plt.ylabel('number of events')
# label bars with data
for bar in all_occurance_funnel_graph.patches:
all_occurance_funnel_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
Conclusion: Examine the number of events
product_page → product_card → purchaselogin, product_page, product_card, and purchaseproduct_cart has less number of events compared to purchase suggesting that it is an optional stage in the funnellogin stage that is not stated in the description, which has the most events per stageThe likely order of events is login → product_page → product_cart (optional) → purchase
Since the product_cart is likely optional I will remove it from futher analysis
Remove the optional event
# all_occurance_funnel_updated = all_occurance_funnel.query('event_name != "product_cart"')
# all_occurance_funnel_updated
# # add a percentage of users and percent change columns
# all_occurance_funnel_updated['perc_occurance_per_event'] = round(all_occurance_funnel_updated['event_occurance'] / len(test_user_events_14_days) * 100, 0)
# all_occurance_funnel_updated['perc_change_per_event'] = round(all_occurance_funnel_updated['event_occurance'].sort_values(ascending = False).pct_change(periods = 1) *100, 0)
# all_occurance_funnel_updated = all_occurance_funnel_updated.fillna(0)
# all_occurance_funnel_updated.sort_values(by = 'event_occurance', ascending = False)
# # plot a funnel chart of all the data
# all_occurance_funnel_updated_chart = go.Figure(go.Funnel(y = list(all_occurance_funnel_updated['event_name']),
# x = list(all_occurance_funnel_updated['event_occurance'])))
# all_occurance_funnel_updated_chart.update_layout(title = 'Funnel of all event occurances')
# all_occurance_funnel_updated_chart.show()
##### V2 ########
# add a percentage of users and percent change columns
all_occurance_funnel['perc_occurance_per_event'] = round(all_occurance_funnel['event_occurance'] / len(test_user_events_14_days) * 100, 0)
all_occurance_funnel['perc_change_per_event'] = round(all_occurance_funnel['event_occurance'].sort_values(ascending = False).pct_change(periods = 1) *100, 0)
all_occurance_funnel = all_occurance_funnel.fillna(0)
all_occurance_funnel = all_occurance_funnel.sort_values(by = 'event_occurance', ascending = False).reset_index(drop = True)
all_occurance_funnel
| event_name | event_occurance | perc_occurance_per_event | perc_change_per_event | |
|---|---|---|---|---|
| 0 | login | 24431 | 43.0 | 0.0 |
| 1 | product_page | 15801 | 28.0 | -35.0 |
| 2 | purchase | 8153 | 14.0 | -48.0 |
| 3 | product_cart | 8043 | 14.0 | -1.0 |
##### V2 ########
all_occurance_funnel_chart = go.Figure(go.Funnel(y = list(all_occurance_funnel['event_name']),
x = list(all_occurance_funnel['event_occurance'])))
all_occurance_funnel_chart.update_layout(title = 'Funnel of all event occurances')
all_occurance_funnel_chart.show()
Create funnels for each group without optional stage
# create queries for each group without optional stage
group_a_query = test_user_events_14_days.query('group == "A" & event_name != "product_cart"')
group_b_query = test_user_events_14_days.query('group == "B" & event_name != "product_cart"')
###### V2 #########
# create queries for each group with optional stage
group_a_query = test_user_events_14_days.query('group == "A"')
group_b_query = test_user_events_14_days.query('group == "B"')
# create event funnel tables for each group
group_a_funnel_mix_test = group_a_query.groupby('event_name').agg({'user_id': ['count', 'nunique']}).reset_index()
group_a_funnel_mix_test.columns = ['event_name', 'group_a_events_mix', 'unique_a_users_mix']
group_a_funnel_mix_test = group_a_funnel_mix_test.sort_values(by = 'group_a_events_mix', ascending = False).reset_index(drop = True)
display(group_a_funnel_mix_test)
group_b_funnel_mix_test = group_b_query.groupby('event_name').agg({'user_id': ['count', 'nunique']}).reset_index()
group_b_funnel_mix_test.columns = ['event_name', 'group_b_events_mix', 'unique_b_users_mix']
group_b_funnel_mix_test = group_b_funnel_mix_test.sort_values(by = 'group_b_events_mix', ascending = False).reset_index(drop = True)
display(group_b_funnel_mix_test)
| event_name | group_a_events_mix | unique_a_users_mix | |
|---|---|---|---|
| 0 | login | 13410 | 3962 |
| 1 | product_page | 8823 | 2619 |
| 2 | purchase | 4573 | 1350 |
| 3 | product_cart | 4344 | 1274 |
| event_name | group_b_events_mix | unique_b_users_mix | |
|---|---|---|---|
| 0 | login | 11021 | 3271 |
| 1 | product_page | 6978 | 2093 |
| 2 | product_cart | 3699 | 1087 |
| 3 | purchase | 3580 | 1067 |
# Create queries of each test group with `interface_eu_test` test only
group_a_eu_query = group_a_query.query('ab_test == "interface_eu_test"')
group_b_eu_query = group_b_query.query('ab_test == "interface_eu_test"')
# create event funnels for `interface_eu_test` participants
group_a_funnel_eu_test = group_a_eu_query.groupby('event_name').agg({'user_id': ['count', 'nunique']}).reset_index()
group_a_funnel_eu_test.columns = ['event_name', 'group_a_events_eu', 'unique_a_users_eu']
group_a_funnel_eu_test = group_a_funnel_eu_test.sort_values(by = 'group_a_events_eu', ascending = False).reset_index(drop = True)
display(group_a_funnel_eu_test)
group_b_funnel_eu_test = group_b_eu_query.groupby('event_name').agg({'user_id': ['count', 'nunique']}).reset_index()
group_b_funnel_eu_test.columns = ['event_name', 'group_b_events_eu', 'unique_b_users_eu']
group_b_funnel_eu_test = group_b_funnel_eu_test.sort_values(by = 'group_b_events_eu', ascending = False).reset_index(drop = True)
display(group_b_funnel_eu_test)
| event_name | group_a_events_eu | unique_a_users_eu | |
|---|---|---|---|
| 0 | login | 10111 | 2921 |
| 1 | product_page | 6701 | 1945 |
| 2 | purchase | 3636 | 1040 |
| 3 | product_cart | 3349 | 959 |
| event_name | group_b_events_eu | unique_b_users_eu | |
|---|---|---|---|
| 0 | login | 9612 | 2790 |
| 1 | product_page | 6213 | 1820 |
| 2 | product_cart | 3305 | 954 |
| 3 | purchase | 3226 | 938 |
# merge the tables
group_a_funnels = pd.merge(group_a_funnel_mix_test, group_a_funnel_eu_test, on = 'event_name')
group_b_funnels = pd.merge(group_b_funnel_mix_test, group_b_funnel_eu_test, on = 'event_name')
display(group_a_funnels)
display(group_b_funnels)
| event_name | group_a_events_mix | unique_a_users_mix | group_a_events_eu | unique_a_users_eu | |
|---|---|---|---|---|---|
| 0 | login | 13410 | 3962 | 10111 | 2921 |
| 1 | product_page | 8823 | 2619 | 6701 | 1945 |
| 2 | purchase | 4573 | 1350 | 3636 | 1040 |
| 3 | product_cart | 4344 | 1274 | 3349 | 959 |
| event_name | group_b_events_mix | unique_b_users_mix | group_b_events_eu | unique_b_users_eu | |
|---|---|---|---|---|---|
| 0 | login | 11021 | 3271 | 9612 | 2790 |
| 1 | product_page | 6978 | 2093 | 6213 | 1820 |
| 2 | product_cart | 3699 | 1087 | 3305 | 954 |
| 3 | purchase | 3580 | 1067 | 3226 | 938 |
# calculate the percenatge of users per stage and percentage of change per stage
group_a_funnels['perc_a_users_mix'] = round(group_a_funnels['unique_a_users_mix'] / group_a_query['user_id'].nunique() * 100, 0)
group_a_funnels['perc_a_change_mix'] = round(group_a_funnels['unique_a_users_mix'].pct_change(periods = 1) * 100, 0)
group_a_funnels['perc_a_users_eu'] = round(group_a_funnels['unique_a_users_eu'] / group_a_eu_query['user_id'].nunique() * 100, 0)
group_a_funnels['perc_a_change_eu'] = round(group_a_funnels['unique_a_users_eu'].pct_change(periods = 1) * 100, 0)
group_a_funnels = group_a_funnels.fillna(0)
display(group_a_funnels)
group_b_funnels['perc_b_users_mix'] = round(group_b_funnels['unique_b_users_mix'] / group_b_query['user_id'].nunique() * 100, 0)
group_b_funnels['perc_b_change_mix'] = round(group_b_funnels['unique_b_users_mix'].pct_change(periods = 1) * 100, 0)
group_b_funnels['perc_b_users_eu'] = round(group_b_funnels['unique_b_users_eu'] / group_b_eu_query['user_id'].nunique() * 100, 0)
group_b_funnels['perc_b_change_eu'] = round(group_b_funnels['unique_b_users_eu'].pct_change(periods = 1) * 100, 0)
group_b_funnels = group_b_funnels.fillna(0)
display(group_b_funnels)
| event_name | group_a_events_mix | unique_a_users_mix | group_a_events_eu | unique_a_users_eu | perc_a_users_mix | perc_a_change_mix | perc_a_users_eu | perc_a_change_eu | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | login | 13410 | 3962 | 10111 | 2921 | 100.0 | 0.0 | 100.0 | 0.0 |
| 1 | product_page | 8823 | 2619 | 6701 | 1945 | 66.0 | -34.0 | 67.0 | -33.0 |
| 2 | purchase | 4573 | 1350 | 3636 | 1040 | 34.0 | -48.0 | 36.0 | -47.0 |
| 3 | product_cart | 4344 | 1274 | 3349 | 959 | 32.0 | -6.0 | 33.0 | -8.0 |
| event_name | group_b_events_mix | unique_b_users_mix | group_b_events_eu | unique_b_users_eu | perc_b_users_mix | perc_b_change_mix | perc_b_users_eu | perc_b_change_eu | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | login | 11021 | 3271 | 9612 | 2790 | 100.0 | 0.0 | 100.0 | 0.0 |
| 1 | product_page | 6978 | 2093 | 6213 | 1820 | 64.0 | -36.0 | 65.0 | -35.0 |
| 2 | product_cart | 3699 | 1087 | 3305 | 954 | 33.0 | -48.0 | 34.0 | -48.0 |
| 3 | purchase | 3580 | 1067 | 3226 | 938 | 33.0 | -2.0 | 34.0 | -2.0 |
Graph the event funnel for the EU test data
# plot a funnel chart for the Eu test data
group_a_funnels_chart = go.Figure(go.Funnel(y = list(group_a_funnels['event_name']),
x = list(group_a_funnels['unique_a_users_eu'])))
group_a_funnels_chart.update_layout(title = 'Event Funnel: Group A `interface_eu_test` participants')
group_a_funnels_chart.show()
# plot a funnel chart for the Eu test data
group_b_funnels_chart = go.Figure(go.Funnel(y = list(group_b_funnels['event_name']),
x = list(group_b_funnels['unique_b_users_eu'])))
group_b_funnels_chart.update_layout(title = 'Event Funnel: Group B `interface_eu_test` participants')
group_b_funnels_chart.show()
Conclusion: Create funnels for each group without optional stage
fig = go.Figure()
fig.add_trace(go.Funnel(name = 'Group A',
y = list(group_b_funnels['event_name']),
x = list(group_a_funnels['unique_a_users_eu'])))
fig.add_trace(go.Funnel(name = 'Group B',
y = list(group_b_funnels['event_name']),
x = list(group_b_funnels['unique_b_users_eu'])))
fig.show()
# merge group a and group b funnel tables
funnels = pd.merge(group_a_funnels, group_b_funnels, on = 'event_name')
funnels
| event_name | group_a_events_mix | unique_a_users_mix | group_a_events_eu | unique_a_users_eu | perc_a_users_mix | perc_a_change_mix | perc_a_users_eu | perc_a_change_eu | group_b_events_mix | unique_b_users_mix | group_b_events_eu | unique_b_users_eu | perc_b_users_mix | perc_b_change_mix | perc_b_users_eu | perc_b_change_eu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | login | 13410 | 3962 | 10111 | 2921 | 100.0 | 0.0 | 100.0 | 0.0 | 11021 | 3271 | 9612 | 2790 | 100.0 | 0.0 | 100.0 | 0.0 |
| 1 | product_page | 8823 | 2619 | 6701 | 1945 | 66.0 | -34.0 | 67.0 | -33.0 | 6978 | 2093 | 6213 | 1820 | 64.0 | -36.0 | 65.0 | -35.0 |
| 2 | purchase | 4573 | 1350 | 3636 | 1040 | 34.0 | -48.0 | 36.0 | -47.0 | 3580 | 1067 | 3226 | 938 | 33.0 | -2.0 | 34.0 | -2.0 |
| 3 | product_cart | 4344 | 1274 | 3349 | 959 | 32.0 | -6.0 | 33.0 | -8.0 | 3699 | 1087 | 3305 | 954 | 33.0 | -48.0 | 34.0 | -48.0 |
# graph the percent conversion from one stage to the next
conversion_graph = funnels[['perc_a_users_eu', 'perc_b_users_eu']].plot(kind = 'bar')
plt.title('Percent conversion from one stage to the next')
conversion_graph.set_xticklabels(funnels['event_name'], rotation = 45, ha = 'right')
plt.xlabel('event stage')
plt.ylabel('percent')
# label bars with data
for bar in conversion_graph.patches:
conversion_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
Is the number of events per user distributed equally in the samples?
display(group_a_funnel_eu_test)
display(group_b_funnel_eu_test)
| event_name | group_a_events_eu | unique_a_users_eu | |
|---|---|---|---|
| 0 | login | 10111 | 2921 |
| 1 | product_page | 6701 | 1945 |
| 2 | purchase | 3636 | 1040 |
| 3 | product_cart | 3349 | 959 |
| event_name | group_b_events_eu | unique_b_users_eu | |
|---|---|---|---|
| 0 | login | 9612 | 2790 |
| 1 | product_page | 6213 | 1820 |
| 2 | product_cart | 3305 | 954 |
| 3 | purchase | 3226 | 938 |
# merge the funnel tables
event_funnel_table = pd.merge(group_a_funnel_eu_test, group_b_funnel_eu_test, on = 'event_name')
event_funnel_table['avg_a_events_per_user'] = event_funnel_table['group_a_events_eu'] / event_funnel_table['unique_a_users_eu']
event_funnel_table['avg_b_events_per_user'] = event_funnel_table['group_b_events_eu'] / event_funnel_table['unique_b_users_eu']
event_funnel_table['perc_diff_btwn_avg_events_per_user'] = event_funnel_table['avg_b_events_per_user'] / event_funnel_table['avg_a_events_per_user']
# event_funnel_table['perc_event_diff_btwn_a_b'] = 100 - event_funnel_table['group_b_events_eu'] / event_funnel_table['group_a_events_eu'] * 100
# event_funnel_table['perc_user_diff_btwn_a_b'] = 100 - event_funnel_table['unique_b_users_eu'] / event_funnel_table['unique_a_users_eu'] * 100
event_funnel_table
| event_name | group_a_events_eu | unique_a_users_eu | group_b_events_eu | unique_b_users_eu | avg_a_events_per_user | avg_b_events_per_user | perc_diff_btwn_avg_events_per_user | |
|---|---|---|---|---|---|---|---|---|
| 0 | login | 10111 | 2921 | 9612 | 2790 | 3.461486 | 3.445161 | 0.995284 |
| 1 | product_page | 6701 | 1945 | 6213 | 1820 | 3.445244 | 3.413736 | 0.990855 |
| 2 | purchase | 3636 | 1040 | 3226 | 938 | 3.496154 | 3.439232 | 0.983719 |
| 3 | product_cart | 3349 | 959 | 3305 | 954 | 3.492179 | 3.464361 | 0.992034 |
event_funnel_table_graph = event_funnel_table[['avg_a_events_per_user', 'avg_b_events_per_user']].plot(kind = 'bar')
plt.title('Average Number of events per event for sample groups A and B')
event_funnel_table_graph.set_xticklabels(event_funnel_table['event_name'], rotation = 45, ha = 'right')
plt.xlabel('event type')
plt.ylabel('avgerage number of events')
# label bars with data
for bar in event_funnel_table_graph.patches:
event_funnel_table_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
Conclusion: Is the number of events per user distributed equally in the samples?
Are there users who enter both samples?
How is the number of events distributed by days?
test_user_events_14_days.head()
| user_id | event_datetime | event_name | purchase_usd | group | ab_test | sign_up_date | region | device | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 96F27A054B191457 | 2020-12-07 04:02:40 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 1 | 96F27A054B191457 | 2020-12-08 09:43:14 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 2 | 96F27A054B191457 | 2020-12-09 00:44:10 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 3 | 96F27A054B191457 | 2020-12-26 00:33:57 | purchase | 9.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
| 4 | 96F27A054B191457 | 2020-12-07 04:02:41 | product_page | 0.00 | B | interface_eu_test | 2020-12-07 | EU | iPhone |
# test_users_eu = test_user_events_14_days.query('event_name != "product_cart" & ab_test == "interface_eu_test"')
# test_users_eu.info()
# # query_test.info()
####### V2 #########
# returning product_cart to analysis
test_users_eu = test_user_events_14_days.query('ab_test == "interface_eu_test"')
test_users_eu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 46153 entries, 0 to 56427 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 46153 non-null object 1 event_datetime 46153 non-null datetime64[ns] 2 event_name 46153 non-null object 3 purchase_usd 46153 non-null float64 4 group 46153 non-null object 5 ab_test 46153 non-null object 6 sign_up_date 46153 non-null datetime64[ns] 7 region 46153 non-null object 8 device 46153 non-null object dtypes: datetime64[ns](2), float64(1), object(6) memory usage: 3.5+ MB
test_users_eu['event_date'] = test_users_eu['event_datetime'].dt.date
test_users_eu.head()
| user_id | event_datetime | event_name | purchase_usd | group | ab_test | sign_up_date | region | device | event_date | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 96F27A054B191457 | 2020-12-07 04:02:40 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone | 2020-12-07 |
| 1 | 96F27A054B191457 | 2020-12-08 09:43:14 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone | 2020-12-08 |
| 2 | 96F27A054B191457 | 2020-12-09 00:44:10 | purchase | 4.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone | 2020-12-09 |
| 3 | 96F27A054B191457 | 2020-12-26 00:33:57 | purchase | 9.99 | B | interface_eu_test | 2020-12-07 | EU | iPhone | 2020-12-26 |
| 4 | 96F27A054B191457 | 2020-12-07 04:02:41 | product_page | 0.00 | B | interface_eu_test | 2020-12-07 | EU | iPhone | 2020-12-07 |
test_users_eu_pivot = test_users_eu.pivot_table(columns = 'group', index = 'event_date',
values = 'event_name', aggfunc = 'count').reset_index()
test_users_eu_pivot.columns = ['event_date', 'A', 'B']
test_users_eu_pivot
| event_date | A | B | |
|---|---|---|---|
| 0 | 2020-12-07 | 955 | 928 |
| 1 | 2020-12-08 | 1070 | 1098 |
| 2 | 2020-12-09 | 1084 | 1072 |
| 3 | 2020-12-10 | 1204 | 1195 |
| 4 | 2020-12-11 | 1231 | 1234 |
| 5 | 2020-12-12 | 1753 | 1607 |
| 6 | 2020-12-13 | 1948 | 1890 |
| 7 | 2020-12-14 | 2340 | 2128 |
| 8 | 2020-12-15 | 2052 | 1973 |
| 9 | 2020-12-16 | 1941 | 1722 |
| 10 | 2020-12-17 | 1324 | 1230 |
| 11 | 2020-12-18 | 1066 | 906 |
| 12 | 2020-12-19 | 923 | 788 |
| 13 | 2020-12-20 | 848 | 660 |
| 14 | 2020-12-21 | 694 | 686 |
| 15 | 2020-12-22 | 654 | 576 |
| 16 | 2020-12-23 | 566 | 555 |
| 17 | 2020-12-24 | 551 | 534 |
| 18 | 2020-12-26 | 405 | 444 |
| 19 | 2020-12-27 | 478 | 391 |
| 20 | 2020-12-28 | 379 | 416 |
| 21 | 2020-12-29 | 319 | 321 |
| 22 | 2020-12-30 | 12 | 2 |
test_users_eu_pivot_graph = test_users_eu_pivot.plot(figsize = (12, 6))
plt.title('Number of events per day per group (EU test)')
plt.xticks(np.arange(test_users_eu_pivot.shape[0])[::1])
test_users_eu_pivot_graph.set_xticklabels(list(test_users_eu_pivot['event_date']), rotation = 45, ha = 'right')
plt.xlabel('event date')
plt.ylabel('number of events')
plt.show()
Conclusion: How is the number of events distributed by days?
Think of the possible details in the data that you have to take into account before starting the A/B test?
# determine the difference between the number of unique users in each group
perc_diff_btwn_groups_eu = 1 - group_b_eu_query['user_id'].nunique() / group_a_eu_query['user_id'].nunique()
print('There is a {:.2%} difference in the number of unique users per group for the `interface_eu_test`'.format(perc_diff_btwn_groups_eu))
perc_diff_btwn_groups_mix = 1 - group_b_query['user_id'].nunique() / group_a_query['user_id'].nunique()
print('There is a {:.2%} difference in the number of unique users per group for the mix of `interface_eu_test` with `recommender_system_test`'.format(perc_diff_btwn_groups_mix))
There is a 4.52% difference in the number of unique users per group for the `interface_eu_test` There is a 17.44% difference in the number of unique users per group for the mix of `interface_eu_test` with `recommender_system_test`
interface_eu_testinterface_eu_test with recommender_system_testActions Performed
interface_eu_test and fall with in 10% of the data for each groupAudience: 15% of the new users from the EU region - confirm that the users in the test data are from the EU regioninterface_eu_test participants are from the EU regionrecommender_system_test contains users from several regionsinterface_eu_test participants are new users; 15% is expectedinterface_eu_testis still about evently split: 2922 and 2790 respectivelyuser_events table for all data collection: 423761user_events table: 58703product_page → product_card → purchaselogin, product_page, product_card, and purchaseproduct_cart has less number of events compared to purchase suggesting that it is an optional stage in the funnellogin stage that is not stated in the description, which has the most events per stagelogin → product_page → product_cart (optional) → purchaseproduct_cart is likely optional I will remove it from futher analysisinterface_eu_testinterface_eu_test with recommender_system_test# create a df with the number of users per event per test group
exp_funnel_table = pd.DataFrame(data = group_a_funnels[['event_name', 'unique_a_users_mix', 'unique_a_users_eu']])
exp_funnel_table[['unique_b_users_mix', 'unique_b_users_eu']] = group_b_funnels[['unique_b_users_mix', 'unique_b_users_eu']]
exp_funnel_table.columns = ['event_name', 'group_a_mix', 'group_a_eu', 'group_b_mix', 'group_b_eu']
# df_test1
exp_funnel_table
| event_name | group_a_mix | group_a_eu | group_b_mix | group_b_eu | |
|---|---|---|---|---|---|
| 0 | login | 3962 | 2921 | 3271 | 2790 |
| 1 | product_page | 2619 | 1945 | 2093 | 1820 |
| 2 | purchase | 1350 | 1040 | 1087 | 954 |
| 3 | product_cart | 1274 | 959 | 1067 | 938 |
# make a table with total number of users per test group
# transpose the table above with the row of interest only
users_per_test = exp_funnel_table.head(1).T
# filter out the first row, which contains the column name and value of the untransposed table/row
users_per_test = users_per_test.iloc[1:, :].reset_index()
users_per_test.reset_index(drop = True)
users_per_test.columns = ['test', 'num_users']
# rename the values in the rows
users_per_test['test'] = users_per_test['test'].replace({'unique_a_users_mix': 'group_a_mix',
'unique_a_users_eu': 'group_a_eu',
'unique_b_users_mix': 'group_b_mix',
'unique_b_users_eu': 'group_b_eu'})
# total_test
users_per_test
| test | num_users | |
|---|---|---|
| 0 | group_a_mix | 3962 |
| 1 | group_a_eu | 2921 |
| 2 | group_b_mix | 3271 |
| 3 | group_b_eu | 2790 |
Test Hypothesis
# create a function for testing the hypotheses
def test_hypothesis(exp_a, exp_b, alpha):
# iterate over each event
for event in exp_funnel_table['event_name']:
# define the successes, i.e. the number of users who performed the stage
successes_a = exp_funnel_table[exp_funnel_table['event_name'] == event][exp_a].iloc[0]
successes_b = exp_funnel_table[exp_funnel_table['event_name'] == event][exp_b].iloc[0]
# define the trials, i.e. the number of users in the experiment group
exp_group_a = users_per_test[users_per_test['test'] == exp_a]['num_users'].iloc[0]
exp_group_b = users_per_test[users_per_test['test'] == exp_b]['num_users'].iloc[0]
# proportion of success in experiment a
p1 = successes_a / exp_group_a
# proportion of success in experiment b
p2 = successes_b / exp_group_b
# success proportion in the combined dataset
p_combined = (successes_a + successes_b) / (exp_group_a + exp_group_b)
# difference between the proportions
difference = p1 - p2
# calculating the statistic in standard deviations of the standard normal distribution
# Z = standard value for a criterion with a standard normal distribution,
# where the mean is 0 and the standard deviation is 1 the expression is distributed as N(0,1)
z = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/exp_group_a + 1/exp_group_b))
# set up the standard normal distribution (mean 0, standard deviation 1)
distr = st.norm(0, 1)
# calculate the p-value
p_value = (1 - distr.cdf(abs(z))) * 2
print('Event: {}:'.format(event))
print('\tp-value: ', p_value)
if (p_value < alpha):
print('\tReject the null hypothesis for {}, there is a signficant difference between the proportions of experiments {} and {}.'
.format(event, exp_a, exp_b))
else:
print('\tFail to reject the null hypothesis for {}, there is not a signficant difference between the proportions of experiments {} and {}.'
.format(event, exp_a, exp_b))
# print the number of users and share of users per event for each experiment group
print('\tExperiment:\t\t{}\t{}\n\tNumber of users:\t{}\t{}\n\tShare of users:\t\t{:.2%}\t{:.2%}'
.format(exp_a, exp_b, exp_group_a, exp_group_b, p1, p2), '\n')
# print('Hypothesis test results for the mixed test: `interface_eu_test` and `recommender_system_test`')
# test_hypothesis(exp_a = 'group_a_mix', exp_b = 'group_b_mix', alpha = 0.05)
print('Hypothesis test results for the `interface_eu_test` only')
test_hypothesis(exp_a = 'group_a_eu', exp_b = 'group_b_eu', alpha = 0.05)
Hypothesis test results for the `interface_eu_test` only Event: login: p-value: nan Fail to reject the null hypothesis for login, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 100.00% 100.00% Event: product_page: p-value: 0.2805805670722996 Fail to reject the null hypothesis for product_page, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 66.59% 65.23% Event: purchase: p-value: 0.26361431462554874 Fail to reject the null hypothesis for purchase, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 35.60% 34.19% Event: product_cart: p-value: 0.5269331183788686 Fail to reject the null hypothesis for product_cart, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 32.83% 33.62%
Actions Performed
Alternative Hypothesis:
2 test were performed
interface_eu_test and recommender_system_testinterface_eu_test only# calculate the FWER
desired_alpha = 0.05
num_tests = 4
fwer = 1 - (1 - desired_alpha) ** num_tests
print('FWER for these experiments is {}, this alpha to be used in the Bonferroni correction.'.format(fwer))
print('This means there is a {:.2%} probability of at least one false positive result out of {} tests'.format(fwer, num_tests))
FWER for these experiments is 0.18549375000000012, this alpha to be used in the Bonferroni correction. This means there is a 18.55% probability of at least one false positive result out of 4 tests
# # calulate the bonferonni correction
# bonferonni_correction = fwer
# print('bonferonni correction: {}'.format(bonferonni_correction))
# # rerun the A/B tests with Bonferroni correction
# print('Bonferroni corrected: Hypothesis test results for the `interface_eu_test` only')
# test_hypothesis(exp_a = 'group_a_eu', exp_b = 'group_b_eu', alpha = bonferonni_correction)
###### V3 ##########
bonferonni_correction_v3 = 0.05 / 4
print('alpha with bonferonni correction: {}'.format(bonferonni_correction_v3))
alpha with bonferonni correction: 0.0125
###### V3 ##########
# rerun the A/B tests with Bonferroni correction
print('Bonferroni corrected: Hypothesis test results for the `interface_eu_test` only')
test_hypothesis(exp_a = 'group_a_eu', exp_b = 'group_b_eu', alpha = bonferonni_correction_v3)
Bonferroni corrected: Hypothesis test results for the `interface_eu_test` only Event: login: p-value: nan Fail to reject the null hypothesis for login, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 100.00% 100.00% Event: product_page: p-value: 0.2805805670722996 Fail to reject the null hypothesis for product_page, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 66.59% 65.23% Event: purchase: p-value: 0.26361431462554874 Fail to reject the null hypothesis for purchase, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 35.60% 34.19% Event: product_cart: p-value: 0.5269331183788686 Fail to reject the null hypothesis for product_cart, there is not a signficant difference between the proportions of experiments group_a_eu and group_b_eu. Experiment: group_a_eu group_b_eu Number of users: 2921 2790 Share of users: 32.83% 33.62%
# the reviewer's cell:
1 - (1 - 0.05/4) ** 4
0.049070288085937275
Project Description and Goal
Review of the Technical description
There are several errors in the original technical description
Update Test name: interface_eu_test
recommender_system_test is a typo because very few users came from this test group and this test has a mix of users from different regions where as below the target audience is for the EU region, which is where most of the users are from in the interface_eu_test data setlogin event), product page views (product_page) and purchases (purchase). At each of the stage of the funnel login → product_page → purchase, there will be at least a 10% increase.product_cart stageproduct_page → product_card → purchaseinterface_eu_testProject Conclusions
interface_eu_test and recommender_system_testinterface_eu_test onlyRecommendations
interface_eu_test because of the poor data collection methodsIf a re-evaluation is not desired then the results show that the new recommendation system is not performing as expected by increasing conversion. It is, however, performing statistically the same as the current system